How Reliable Are RAG Models?

A new paper by Wu et al. (2024) explores the balance between the information retrieved by Retrieval-Augmented Generation (RAG) models and the internal knowledge of large language models (LLMs).

The study, which focuses on GPT-4 and other LLMs in question-answering tasks, finds that when correct information is retrieved, the model's accuracy improves significantly (up to 94%).

When documents contain wrong information, and the model’s internal knowledge (its "prior") is weak, the model is more likely to repeat the incorrect details. However, if the model has a strong internal understanding, it is better at resisting bad data.

The paper also points out that the further the retrieved information strays from the model's internal knowledge, the less likely the model is to accept it.

As many companies use RAG systems in real-world applications, this research highlights the need to evaluate the risks when LLMs are exposed to supporting, contradicting, or false information.